Indexing methods for web archives
نویسنده
چکیده
There have been numerous efforts recently to digitize previously published content and preserving born-digital content leading to the widespread growth of large text repositories. Web archives are such continuously growing text collections which contain versions of documents spanning over long time periods. Web archives present many opportunities for historical, cultural and political analyses. Consequently there is a growing need for tools which can efficiently access and search them. In this work, we are interested in indexing methods for supporting text-search workloads over web archives like time-travel queries and phrase queries. To this end we make the following contributions: • Time-travel queries are keyword queries with a temporal predicate, e.g., “mpii saarland” @ [06/2009], which return versions of documents in the past. We introduce a novel index organization strategy, called index sharding, for efficiently supporting time-travel queries without incurring additional index-size blowup. We also propose index-maintenance approaches which scale to such continuously growing collections. • We develop query-optimization techniques for time-travel queries called partition selection which maximizes recall at any given query-execution stage. • We propose indexing methods to support phrase queries, e.g., “to be or not to be that is the question”. We index multi-word sequences and devise novel queryoptimization methods over the indexed sequences to efficiently answer phrase queries. We demonstrate the superior performance of our approaches over existing methods by extensive experimentation on real-world web archives.
منابع مشابه
بررسی وضعیت نمایه شدن مجلات لاتین مصوب علوم پزشکی کشور در نمایهنامه های معتبر جهانی
Background and Aim: Today journals are one of the main platforms to exchange information between researchers. This study aimed to assess the status of Approved Latin indexing journals in the field of medical science citation indexes Web of Science and Scopus databases. Materials and Methods: This study was a cross-sectional descriptive survey. Statistical population of the study was 83 titles ...
متن کاملArchiving, Indexing and Accessing Web Materials: Solutions for large amounts of data
The archiving of Internet materials presents two major challenges: the dynamic nature of the content and the massive number of individual web pages. These two characteristics impact the choice of methods for storing, indexing and providing fast access to archives of materials retrieved from periodic web crawling activities. At the San Diego Supercomputer Center, we have applied two different te...
متن کاملAn experimental study of an audio indexing system for the web
We have developed a speech recognition based audio search engine for indexing spoken documents found on the World Wide Web. Our site (http://www.compaq.com/speechbot) indexes around 20 news and talk radio shows covering a wide range of topics, speaking styles and acoustic conditions from a selection of public Web sites with multimedia archives. In this paper, we describe our system and its perf...
متن کاملSemantics for Multimedia on the Web
The vision of the Semantic Web entails that large amounts of multimedia data should be annotated with semantic meta data. Current technology for content-based image interpretation is too limited for automated annotation of visual material. Techniques used by image search engines are also very poor and are unlikely to be improved in the near future. So, human annotations are required to make lar...
متن کاملHidden Web Indexing Using HDDI Framework
There are various methods of indexing the hidden web database like novel indexing, distributed indexing or indexing using map reduce framework. Our goal is to find an optimized indexing technique keeping in mind the various factors like searching, distribute database, updating of web, etc. Here, we propose an optimized method for indexing the hidden web database. This research uses Hierarchical...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2013